1 Introduction

Here is an Exploratory Data Analysis for the Google Analytics Customer Revenue Prediction competition within the R environment. For this EDA in the main we will use packages:

Also for modelling we will use packages:

Our task is to build an algorithm that predicts the natural log of the sum of all transactions per user. Thus, for every user in the test set, the target is:

\[y_{user} = \sum_{i=1}^n transaction_{user_i}\]

\[target_{user} = ln(y_{user}+1).\]

Submissions are scored on the root mean squared error, which is defined as:

\[RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \widehat{y_i})^2},\]

where \(\widehat{y}\) is the predicted revenue for a customer and \(y\) is the natural log of the actual revenue value.

Let’s prepare and have a look at the dataset.

2 Preparations

2.1 Load libraries

library(h2o)
library(caret)
library(lme4)
library(ggalluvial)
library(xgboost)
library(jsonlite)
library(lubridate)
library(knitr)
library(Rmisc)
library(scales)
library(countrycode)
library(highcharter)
library(glmnet)
library(keras)
library(forecast)
library(zoo)
library(magrittr)
library(tidyverse)

2.2 Load data

set.seed(0)

tr <- read_csv("./data/train.csv")
te <- read_csv("./data/test.csv")
subm <- read_csv("./data/sample_submission.csv")

3 Peek at the dataset

3.1 General info

Train set file size: NA bytesTrain set dimensions: 903653 12Observations: 903,653 Variables: 12 $ channelGrouping “Organic Search”, “Organic Search”, “Orga… $ date 20160902, 20160902, 20160902, 20160902, 2… $ device ”{"browser": "Chrome", "browserVersi… $ fullVisitorId “1131660440785968503”, “37730602087792789… $ geoNetwork ”{"continent": "Asia", "subContinent… $ sessionId “1131660440785968503_1472830385”, “377306… $ socialEngagementType ”Not Socially Engaged“,”Not Socially Eng… $ totals “{"visits": "1", "hits": "1", "p… $ trafficSource ”{"campaign": "(not set)", "source"… $ visitId 1472830385, 1472880147, 1472865386, 14728… $ visitNumber 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1,… $ visitStartTime 1472830385, 1472880147, 1472865386, 14728…

Test set file size: NA bytesTest set dimensions: 804684 12Observations: 804,684 Variables: 12 $ channelGrouping “Organic Search”, “Organic Search”, “Orga… $ date 20171016, 20171016, 20171016, 20171016, 2… $ device ”{"browser": "Chrome", "browserVersi… $ fullVisitorId “6167871330617112363”, “06436976409779156… $ geoNetwork ”{"continent": "Asia", "subContinent… $ sessionId “6167871330617112363_1508151024”, “064369… $ socialEngagementType ”Not Socially Engaged“,”Not Socially Eng… $ totals “{"visits": "1", "hits": "4", "p… $ trafficSource ”{"campaign": "(not set)", "source"… $ visitId 1508151024, 1508175522, 1508143220, 15081… $ visitNumber 2, 1, 1, 1, 1, 1, 1, 1, 6, 1, 1, 1, 1, 1,… $ visitStartTime 1508151024, 1508175522, 1508143220, 15081…

3.2 Distribution of transaction dates

As shown in the figure, there are only a few of the transactions after Jul 2017 in the train set, because the rest is in the test set. It makes sense to create time-based splits for train/validation sets.

3.3 Dataset columns

There is a total of 12 features:

  • fullVisitorId - an unique identifier for each user of the Google Merchandise Store
  • channelGrouping - the channel via which the user came to the Store
  • date - the date on which the user visited the Store
  • device - the specifications for the device used to access the Store
  • geoNetwork - this section contains information about the geography of the user
  • sessionId - an unique identifier for this visit to the store
  • socialEngagementType - engagement type, either “Socially Engaged” or “Not Socially Engaged”
  • totals - this section contains aggregate values across the session
  • trafficSource - this section contains information about the Traffic Source from which the session originated
  • visitId - an identifier for this session
  • visitNumber - the session number for this user
  • visitStartTime - the timestamp (POSIX).

Let’s have a look at counts of the simple features:

Attribute ‘socialEngagementType’ can be removed as there are only 1 value

3.4 JSON data

Actually the columns device, geoNetwork, trafficSource, totals contain data in JSON format. To parse it we can use jsonlite package

flatten_json <- . %>% 
  str_c(., collapse = ",") %>% 
  str_c("[", ., "]") %>% 
  fromJSON(flatten = T)

parse <- . %>% 
  bind_cols(flatten_json(.$device)) %>%
  bind_cols(flatten_json(.$geoNetwork)) %>% 
  bind_cols(flatten_json(.$trafficSource)) %>% 
  bind_cols(flatten_json(.$totals)) %>% 
  select(-device, -geoNetwork, -trafficSource, -totals)

Let’s convert train and test sets to the tidy format:

tr <- parse(tr)
te <- parse(te)

3.5 Tidy datasets

3.5.1 Train

channelGrouping date fullVisitorId sessionId socialEngagementType visitId visitNumber visitStartTime browser browserVersion browserSize operatingSystem operatingSystemVersion isMobile mobileDeviceBranding mobileDeviceModel mobileInputSelector mobileDeviceInfo mobileDeviceMarketingName flashVersion language screenColors screenResolution deviceCategory continent subContinent country region metro city cityId networkDomain latitude longitude networkLocation campaign source medium keyword isTrueDirect referralPath adContent campaignCode adwordsClickInfo.criteriaParameters adwordsClickInfo.page adwordsClickInfo.slot adwordsClickInfo.gclId adwordsClickInfo.adNetworkType adwordsClickInfo.isVideoAd visits hits pageviews bounces newVisits transactionRevenue
Organic Search 20160902 1131660440785968503 1131660440785968503_1472830385 Not Socially Engaged 1472830385 1 1472830385 Chrome not available in demo dataset not available in demo dataset Windows not available in demo dataset FALSE not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset desktop Asia Western Asia Turkey Izmir (not set) Izmir not available in demo dataset ttnet.com.tr not available in demo dataset not available in demo dataset not available in demo dataset (not set) google organic (not provided) NA NA NA NA not available in demo dataset NA NA NA NA NA 1 1 1 1 1 NA
Organic Search 20160902 377306020877927890 377306020877927890_1472880147 Not Socially Engaged 1472880147 1 1472880147 Firefox not available in demo dataset not available in demo dataset Macintosh not available in demo dataset FALSE not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset desktop Oceania Australasia Australia not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset dodo.net.au not available in demo dataset not available in demo dataset not available in demo dataset (not set) google organic (not provided) NA NA NA NA not available in demo dataset NA NA NA NA NA 1 1 1 1 1 NA

3.5.2 Test

channelGrouping date fullVisitorId sessionId socialEngagementType visitId visitNumber visitStartTime browser browserVersion browserSize operatingSystem operatingSystemVersion isMobile mobileDeviceBranding mobileDeviceModel mobileInputSelector mobileDeviceInfo mobileDeviceMarketingName flashVersion language screenColors screenResolution deviceCategory continent subContinent country region metro city cityId networkDomain latitude longitude networkLocation campaign source medium keyword isTrueDirect referralPath adContent adwordsClickInfo.criteriaParameters adwordsClickInfo.page adwordsClickInfo.slot adwordsClickInfo.gclId adwordsClickInfo.adNetworkType adwordsClickInfo.isVideoAd visits hits pageviews newVisits bounces
Organic Search 20171016 6167871330617112363 6167871330617112363_1508151024 Not Socially Engaged 1508151024 2 1508151024 Chrome not available in demo dataset not available in demo dataset Macintosh not available in demo dataset FALSE not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset desktop Asia Southeast Asia Singapore (not set) (not set) (not set) not available in demo dataset myrepublic.com.sg not available in demo dataset not available in demo dataset not available in demo dataset (not set) google organic (not provided) TRUE NA NA not available in demo dataset NA NA NA NA NA 1 4 4 NA NA
Organic Search 20171016 0643697640977915618 0643697640977915618_1508175522 Not Socially Engaged 1508175522 1 1508175522 Chrome not available in demo dataset not available in demo dataset Windows not available in demo dataset FALSE not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset not available in demo dataset desktop Europe Southern Europe Spain Aragon (not set) Zaragoza not available in demo dataset rima-tde.net not available in demo dataset not available in demo dataset not available in demo dataset (not set) google organic (not provided) NA NA NA not available in demo dataset NA NA NA NA NA 1 5 5 1 NA

3.5.3 Sample Submission

fullVisitorId PredictedLogRevenue
0000000259678714014 0
0000049363351866189 0
0000053049821714864 0
0000059488412965267 0
0000085840370633780 0

3.6 Train and test features sets intersection

setdiff(names(tr), names(te))
## [1] "campaignCode"       "transactionRevenue"
tr %<>% select(-one_of("campaignCode"))

The test set lacks two columns. One column is a target variable transactionRevenue. The second column (campaignCode) we remove from the train set.

3.7 Constant columns

Let’s find constant columns: All these useless features we can safely remove.

fea_uniq_values <- sapply(tr, n_distinct)
(fea_del <- names(fea_uniq_values[fea_uniq_values == 1]))
##  [1] "socialEngagementType"               
##  [2] "browserVersion"                     
##  [3] "browserSize"                        
##  [4] "operatingSystemVersion"             
##  [5] "mobileDeviceBranding"               
##  [6] "mobileDeviceModel"                  
##  [7] "mobileInputSelector"                
##  [8] "mobileDeviceInfo"                   
##  [9] "mobileDeviceMarketingName"          
## [10] "flashVersion"                       
## [11] "language"                           
## [12] "screenColors"                       
## [13] "screenResolution"                   
## [14] "cityId"                             
## [15] "latitude"                           
## [16] "longitude"                          
## [17] "networkLocation"                    
## [18] "adwordsClickInfo.criteriaParameters"
## [19] "visits"
tr %<>% select(-one_of(fea_del))
te %<>% select(-one_of(fea_del))

3.8 Missing values

After parsing of the JSON data we can observe many missing values in the data set. Let’s find out how many missing values each feature has. We need to take into account that such values as “not available in demo dataset”, “(not set)”, “unknown.unknown”, “(not provided)” can be treated as NA.

is_na_val <- function(x) x %in% c("not available in demo dataset", "(not provided)",
                                  "(not set)", "<NA>", "unknown.unknown",  "(none)")

tr %<>% mutate_all(funs(ifelse(is_na_val(.), NA, .)))
te %<>% mutate_all(funs(ifelse(is_na_val(.), NA, .)))

There is a bunch of features missing nearly completely.

3.9 Simple transformations

tr %<>%
  mutate(date = ymd(date),
         hits = as.integer(hits),
         pageviews = as.integer(pageviews),
         bounces = as.integer(bounces),
         newVisits = as.integer(newVisits),
         transactionRevenue = as.numeric(transactionRevenue))
         
te %<>%
  mutate(date = ymd(date),
         hits = as.integer(hits),
         pageviews = as.integer(pageviews),
         bounces = as.integer(bounces),
         newVisits = as.integer(newVisits))         

3.10 Target variable

As a target variable we use transactionRevenue which is a sub-column of the totals JSON column. It looks like this variable is multiplied by \(10^6\).

y <- tr$transactionRevenue
tr$transactionRevenue <- NULL
summary(y)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.      NA's 
## 1.000e+04 2.493e+07 4.945e+07 1.337e+08 1.077e+08 2.313e+10    892138

We can safely replace NA values with 0.

y[is.na(y)] <- 0
summary(y)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 0.000e+00 0.000e+00 1.704e+06 0.000e+00 2.313e+10

The target variable has a wide range of values. Its distribution is right-skewed. For modelling we will use log-transformed target.

BTW, only 1.27% of all transactions have non-zero revenue:

The figure below shows that users who came via Affiliates and Social channels do not generate revenue. The most profitable channel is Referral:

Also usually first visit users generate more total revenue, but need to be noted that most user has visited the website only once.

3.11 How target variable changes in time

The revenue itself can be viewed as a timeseries. There seems to be a pattern of peaks.

There is an interesting separation in target variable by isTrueDirect feature:

3.12 Revenue forecasting

Let’s see if we can predict log-transformed mean daily revenue using timeseries.Here we use: zoo and forecast packages for timeseries modelling

tr %>% 
  bind_cols(tibble(revenue = y)) %>% 
  group_by(date) %>% 
  summarize(mean_revenue = log1p(mean(revenue/1e6))) %>% 
  ungroup() %>% 
  with(zoo(mean_revenue, order.by = date)) ->
  revenue
  
h <- max(te$date) - min(te$date) + 1
  
revenue %>% 
  autoplot() + 
  geom_line() +
  geom_smooth() + 
  labs(x = "", y = "log(revenue)") +
  theme_minimal()

We use simple auto.arima model. We need to forecast for the period of 272 days.

m_aa <- auto.arima(revenue)
summary(m_aa)
## Series: revenue 
## ARIMA(4,0,3) with non-zero mean 
## 
## Coefficients:
##           ar1      ar2     ar3      ar4     ma1     ma2      ma3    mean
##       -0.2547  -0.4445  0.3578  -0.0995  0.6029  0.5490  -0.3260  0.8924
## s.e.   0.3520   0.2285  0.2290   0.0782  0.3492  0.3354   0.3007  0.0249
## 
## sigma^2 estimated as 0.1451:  log likelihood=-162.27
## AIC=342.54   AICc=343.04   BIC=377.66
## 
## Training set error measures:
##                         ME      RMSE       MAE  MPE MAPE     MASE
## Training set -0.0003557683 0.3766821 0.2882868 -Inf  Inf 0.790288
##                      ACF1
## Training set -0.004423992
forecast(m_aa, h = h) %>% 
  autoplot() + 
  theme_minimal()

Clearly, this model is of no use for long time period forecasting. Let’s add a regression term mean pageviews:

tr %>% 
  group_by(date) %>% 
  summarize(mean_pv = log1p(mean(pageviews, na.rm=TRUE))) %>% 
  ungroup() %$% 
  mean_pv ->
  mean_pv_tr

te %>% 
  group_by(date) %>% 
  summarize(mean_pv = log1p(mean(pageviews, na.rm=TRUE))) %>% 
  ungroup() %$% 
  mean_pv ->
  mean_pv_te
m_aa_reg <- auto.arima(revenue, xreg = mean_pv_tr)
summary(m_aa_reg)
## Series: revenue 
## Regression with ARIMA(3,1,2) errors 
## 
## Coefficients:
##          ar1      ar2      ar3      ma1     ma2    xreg
##       0.7110  -0.2203  -0.0852  -1.3928  0.4779  2.4881
## s.e.  0.2007   0.0800   0.0688   0.1959  0.1750  0.3198
## 
## sigma^2 estimated as 0.1216:  log likelihood=-131.05
## AIC=276.11   AICc=276.42   BIC=303.41
## 
## Training set error measures:
##                      ME     RMSE       MAE  MPE MAPE      MASE
## Training set 0.01178035 0.345297 0.2658876 -Inf  Inf 0.7288843
##                      ACF1
## Training set -0.004276301
forecast(m_aa_reg, h = h, xreg = mean_pv_te) %>% 
  autoplot() + 
  theme_minimal()

This forecast looks much better. It is possible to use it in a more complex model.

3.13 Time features

The dataset contains the timestamp column visitStartTime expressed as POSIX time. It allows us to create a bunch of features. Let’s check ‘symmetric differences’ of the time features from the train and test sets.

tr_vst <- as_datetime(tr$visitStartTime)
te_vst <- as_datetime(te$visitStartTime)

symdiff <- function(x, y) setdiff(union(x, y), intersect(x, y))

Year:

symdiff(tr_vst %>% year %>% unique, te_vst %>% year %>% unique)
## [1] 2016 2018

Month:

symdiff(tr_vst %>% month %>% unique, te_vst %>% month %>% unique)
## [1] 6 7

Day:

symdiff(tr_vst %>% day %>% unique, te_vst %>% day %>% unique)
## integer(0)

Week:

symdiff(tr_vst %>% week %>% unique, te_vst %>% week %>% unique)
##  [1] 25 24 26 20 22 27 19 28 23 30 21 29

Day of the year:

symdiff(tr_vst %>% yday %>% unique, te_vst %>% yday %>% unique)
##  [1] 174 175 164 165 172 173 122 162 163 182 181 139 140 176 150 149 136
## [18] 135 213 168 169 167 366 186 185 131 132 179 178 166 195 196 157 158
## [35] 205 206 143 142 123 124 159 171 210 209 204 145 146 138 141 187 188
## [52] 203 202 126 127 152 153 134 128 129 170 161 160 156 125 184 192 193
## [69] 183 148 154 155 144 208 207 201 130 189 133 137 194 200 212 190 151
## [86] 177 211 147 191 199 197 198 180

Hour:

symdiff(tr_vst %>% hour %>% unique, te_vst %>% hour %>% unique)
## integer(0)

We can see that some time features (week, month, day of the year) from the train and test sets differ notably. Thus, they can cause overfitiing, but year and hour can be useful.

3.14 Distribution of visits and revenue by attributes

Summary:

  • The most frequent channels are OrganicSearch and Social.
  • Chrome is the most popular browser and its users produce the highest total revenue.
  • Windows and MacOS are the most popular operating systems. It’s interesting that ChromeOS users yield the highest mean revenue.
  • Desktops are still in the ranks.
  • The US users yield the most of the total revenue.
  • Usually netwok domain is unknown.
  • organic and referral are the most popular mediums.

3.14.1 channelGrouping

3.14.2 browser

3.14.3 operatingSystem

3.14.4 deviceCategory

3.14.5 country

3.14.6 city

3.14.7 networkDomain

3.14.8 medium

4 Alluvial diagram

This kind of plot is useful for discovering of multi-feature interactions.

The next figure shows the flows for the case when revenue > 0:

Non-zero transaction revenue in the main is yielded by the flow US-desktop-Chrome-{OrganicSearch | Referral}-net.

5 Bot or not?

6 World map: important values

7 Correlations between revenue and features

Some features are categorical and we reencode them as OHE (with reduced set of levels). The ID columns are dropped.

m <- tr %>% 
  dplyr::mutate(year = year(date),
               month = month(date),
               day = day(date),
               isMobile = ifelse(isMobile, 1L, 0L),
               isTrueDirect = ifelse(isMobile, 1L, 0L)) %>% 
  mutate_all(funs(ifelse(is.na(.), 0, .))) %>% 
  select(-date, -fullVisitorId, -visitId, -sessionId) %>% 
  mutate_if(is.character, factor) %>% 
  mutate_if(is.factor, fct_lump, prop = 0.01) %>% 
  model.matrix(~ . - 1, .) %>% 
  cor(y) %>% 
  data.table::as.data.table(keep.rownames=TRUE) %>% 
  purrr::set_names("Feature", "rho") %>% 
  arrange(-rho) 

m %>% 
  ggplot(aes(x = rho)) +
  geom_histogram(bins = 50, fill="steelblue") + 
  labs(x = "correlation") +
  theme_minimal()

The values of the correlation coefficient are concentrated around zero, but there are several values bigger than 0.1:

m %>% 
  filter(rho > 0.1) %>% 
  kable()
Feature rho
pageviews 0.1555894
hits 0.1543326

Let’s visualize the relationship of the target variable with each of the correlated variables.

Here we observe weak positive relationship. Although, these features can play important role in a statistical model.

8 Simple autoencoder

To train an autoencoder we can use h2o package:

h2o.no_progress()
h2o.init(nthreads = 4, max_mem_size = "10G")
## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\ANDY~1.YUA\AppData\Local\Temp\Rtmpq6o9O9/h2o_andy_yuan_started_from_r.out
##     C:\Users\ANDY~1.YUA\AppData\Local\Temp\Rtmpq6o9O9/h2o_andy_yuan_started_from_r.err
## 
## 
## Starting H2O JVM and connecting: . Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         7 seconds 118 milliseconds 
##     H2O cluster timezone:       Asia/Singapore 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.20.0.8 
##     H2O cluster version age:    1 month and 29 days  
##     H2O cluster name:           H2O_started_from_R_andy.yuan_xgj592 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   8.89 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.5.1 (2018-07-02)
tr_h2o <- as.h2o(tr)
te_h2o <- as.h2o(te)

n_ae <- 4

Let’s train a simple model, which compresses the input space to 4 components:

m_ae <- h2o.deeplearning(training_frame = tr_h2o,
                         x = 1:ncol(tr_h2o),
                         autoencoder = T,
                         activation="Rectifier",
                         reproducible = TRUE,
                         seed = 0,
                         sparse = T,
                         standardize = TRUE,
                         hidden = c(32, n_ae, 32),
                         max_w2 = 5,
                         epochs = 25)
tr_ae <- h2o.deepfeatures(m_ae, tr_h2o, layer = 2) %>% as_tibble
te_ae <- h2o.deepfeatures(m_ae, te_h2o, layer = 2) %>% as_tibble

rm(tr_h2o, te_h2o, m_ae); invisible(gc())
h2o.shutdown(prompt = FALSE)
## [1] TRUE

In the figures we can observe a two-class separation. This information can be useful for statistical models.

9 Linear Mixed Model

In this section I’d like to present a linear mixed model. The data set in this competition is hierarchical and contains session-level entries, but we have to predict aggregated - user level - values. We can consider rows with identical fullVisitorId as repeated measurements. Thus, the grouping factor here is fullVisitorId. For the LMM we will use the most important variables identified by the XGB model. There are several packages for LMM in R. I use the lme4 package. You can find a tutorial for LMM here.

tri <- 1:nrow(tr)
data <- tr %>% 
  bind_cols(tibble(revenue = y)) %>% 
  bind_rows(te) %>% 
  select(revenue, pageviews, hits, visitNumber, fullVisitorId) %>% 
  mutate_each(funs(as.numeric(.) %>% log1p), -fullVisitorId) %>% 
  mutate(pageviews = ifelse(is.na(pageviews), 0, pageviews))

  
m_lmm0 <- glmer(revenue ~ (1|fullVisitorId), data = data[tri, ])

bg_var <- summary(m_lmm0)$varcor$fullVisitorId[1]
resid_var <- attr(summary(m_lmm0)$varcor, "sc")^2

summary(m_lmm0)
## Linear mixed model fit by REML ['lmerMod']
## Formula: revenue ~ (1 | fullVisitorId)
##    Data: data[tri, ]
## 
## REML criterion at convergence: 3810968
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.1182 -0.1005 -0.1005 -0.0950 11.7280 
## 
## Random effects:
##  Groups        Name        Variance Std.Dev.
##  fullVisitorId (Intercept) 0.2302   0.4798  
##  Residual                  3.7493   1.9363  
## Number of obs: 903653, groups:  fullVisitorId, 714167
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 0.206497   0.002154   95.86

The between-group-variance is estimated as 0.2302456. The total variance is equal to 3.9795885. Thus, the intraclass-correlation coefficient (ICC) is equal to 0.058. Such low ICC demonstrates that the rows in the same group don’t resemble each other that much.

Let’s create more complex models:

m_lmm1 <- update(m_lmm0, revenue ~ pageviews + (1|fullVisitorId))
m_lmm2 <- update(m_lmm0, revenue ~ pageviews + hits + (1|fullVisitorId))
m_lmm3 <- update(m_lmm0, revenue ~ pageviews + hits + visitNumber + (1|fullVisitorId))

anova(m_lmm0, m_lmm1, m_lmm2, m_lmm3)
## Data: data[tri, ]
## Models:
## m_lmm0: revenue ~ (1 | fullVisitorId)
## m_lmm1: revenue ~ pageviews + (1 | fullVisitorId)
## m_lmm2: revenue ~ pageviews + hits + (1 | fullVisitorId)
## m_lmm3: revenue ~ pageviews + hits + visitNumber + (1 | fullVisitorId)
##        Df     AIC     BIC   logLik deviance   Chisq Chi Df Pr(>Chisq)    
## m_lmm0  3 3810963 3810999 -1905479  3810957                              
## m_lmm1  4 3718035 3718082 -1859013  3718027 92930.7      1  < 2.2e-16 ***
## m_lmm2  5 3716490 3716549 -1858240  3716480  1546.7      1  < 2.2e-16 ***
## m_lmm3  6 3712719 3712789 -1856353  3712707  3773.5      1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
pred_lmm <- predict(m_lmm3)

rm(data, m_lmm0, m_lmm1, m_lmm2, m_lmm3); invisible(gc())

The most complex model has the lowest AIC and we will use it for predictions.

10 Basic models

It is always useful at the begining of the competition to create several simple models and compare them. Here for all models we use a preprocessed dataset:

grp_mean <- function(x, grp) ave(x, grp, FUN = function(x) mean(x, na.rm = TRUE))

idx <- tr$date < ymd("20170701")
id <- te[, "fullVisitorId"]
tri <- 1:nrow(tr)

tr_te <- tr %>%
  bind_rows(te) %>% 
  mutate(year = year(date) %>% factor(),
         wday = wday(date) %>% factor(),
         hour = hour(as_datetime(visitStartTime)) %>% factor(),
         isMobile = ifelse(isMobile, 1L, 0L),
         isTrueDirect = ifelse(isTrueDirect, 1L, 0L),
         adwordsClickInfo.isVideoAd = ifelse(!adwordsClickInfo.isVideoAd, 0L, 1L)) %>% 
  select(-date, -fullVisitorId, -visitId, -sessionId, -hits, -visitStartTime) %>% 
  mutate_if(is.character, factor) %>% 
  mutate(pageviews_mean_vn = grp_mean(pageviews, visitNumber),
         pageviews_mean_country = grp_mean(pageviews, country),
         pageviews_mean_city = grp_mean(pageviews, city),
         pageviews_mean_dom = grp_mean(pageviews, networkDomain),
         pageviews_mean_ref = grp_mean(pageviews, referralPath)) %T>% 
  glimpse()
## Observations: 1,708,337
## Variables: 36
## $ channelGrouping                <fct> Organic Search, Organic Search,...
## $ visitNumber                    <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1...
## $ browser                        <fct> Chrome, Firefox, Chrome, UC Bro...
## $ operatingSystem                <fct> Windows, Macintosh, Windows, Li...
## $ isMobile                       <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1...
## $ deviceCategory                 <fct> desktop, desktop, desktop, desk...
## $ continent                      <fct> Asia, Oceania, Europe, Asia, Eu...
## $ subContinent                   <fct> Western Asia, Australasia, Sout...
## $ country                        <fct> Turkey, Australia, Spain, Indon...
## $ region                         <fct> Izmir, NA, Community of Madrid,...
## $ metro                          <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ city                           <fct> Izmir, NA, Madrid, NA, NA, NA, ...
## $ networkDomain                  <fct> ttnet.com.tr, dodo.net.au, NA, ...
## $ campaign                       <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ source                         <fct> google, google, google, google,...
## $ medium                         <fct> organic, organic, organic, orga...
## $ keyword                        <fct> NA, NA, NA, google + online, NA...
## $ isTrueDirect                   <int> NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ referralPath                   <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adContent                      <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.page          <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.slot          <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.gclId         <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.adNetworkType <fct> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.isVideoAd     <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ pageviews                      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ bounces                        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ newVisits                      <int> 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, ...
## $ year                           <fct> 2016, 2016, 2016, 2016, 2016, 2...
## $ wday                           <fct> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hour                           <fct> 15, 5, 1, 5, 13, 9, 11, 10, 8, ...
## $ pageviews_mean_vn              <dbl> 3.347945, 3.347945, 3.347945, 3...
## $ pageviews_mean_country         <dbl> 1.907835, 3.137253, 2.757053, 2...
## $ pageviews_mean_city            <dbl> 1.601546, 1.000000, 2.523963, 1...
## $ pageviews_mean_dom             <dbl> 1.728231, 2.856089, 1.000000, 1...
## $ pageviews_mean_ref             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
rm(tr, te, tr_ae, te_ae); invisible(gc())

To make a submission file we will use this function:

submit <- . %>% 
  as_tibble() %>% 
  set_names("y") %>% 
  mutate(y = ifelse(y < 0, 0, expm1(y))) %>% 
  bind_cols(id) %>% 
  group_by(fullVisitorId) %>% 
  summarise(y = log1p(sum(y))) %>% 
  right_join(
    read_csv("./data/sample_submission.csv"), 
    by = "fullVisitorId") %>% 
  mutate(PredictedLogRevenue = round(y, 5)) %>% 
  select(-y) %>% 
  write_csv(sub)

10.1 GLMNET

For the glmnet model we need a model matrix. We replace NA values with zeros, rare factor levels are lumped:

tr_te_ohe <- tr_te %>% 
  mutate_if(is.factor, fct_explicit_na) %>% 
  mutate_if(is.numeric, funs(ifelse(is.na(.), 0L, .))) %>% 
  mutate_if(is.factor, fct_lump, prop = 0.05) %>% 
  select(-adwordsClickInfo.isVideoAd) %>% 
  model.matrix(~.-1, .) %>% 
  scale() %>% 
  round(4)

X <- tr_te_ohe[tri, ]
X_test <- tr_te_ohe[-tri, ]
rm(tr_te_ohe); invisible(gc())

The next step is to create a cross-validated LASSO model:

m_glm <- cv.glmnet(X, log1p(y), alpha = 0, family="gaussian", 
                   type.measure = "mse", nfolds = 5)

Finally, we create predictions of the LASSO model

pred_glm_tr <- predict(m_glm, X, s = "lambda.min") %>% c()
pred_glm <- predict(m_glm, X_test, s = "lambda.min") %>% c()
sub <- "glmnet_gs.csv"
submit(pred_glm)

rm(m_glm); invisible(gc())

10.2 Keras

For a neural net we can use the same model matrix. Let’s create a simple sequential model:

m_nn <- keras_model_sequential() 
m_nn %>% 
  layer_dense(units = 256, activation = "relu", input_shape = ncol(X)) %>% 
  layer_dropout(rate = 0.5) %>% 
  layer_dense(units = 128, activation = "relu") %>%
  layer_dropout(rate = 0.25) %>%
  layer_dense(units = 1, activation = "linear")

Next, we compile the model with appropriate parameters:

m_nn %>% compile(loss = "mean_squared_error",
                 metrics = custom_metric("rmse", function(y_true, y_pred) 
                   k_sqrt(metric_mean_squared_error(y_true, y_pred))),
                 optimizer = optimizer_adadelta())

Then we train the model:

history <- m_nn %>% 
  fit(X, log1p(y), 
      epochs = 50, 
      batch_size = 128, 
      verbose = 0, 
      validation_split = 0.2,
      callbacks = callback_early_stopping(patience = 5))

And finally, predictions:

pred_nn_tr <- predict(m_nn, X) %>% c()
pred_nn <- predict(m_nn, X_test) %>% c()
sub <- "keras_gs.csv"
submit(pred_nn)

rm(m_nn, X, X_test); invisible(gc())

10.3 XGB

At last, we are ready to create a simple XGB model. First, we need to preprocess the dataset. We don’t care about NA values - XGB handles them by default:

tr_te_xgb <- tr_te %>% 
  mutate_if(is.factor, as.integer) %>% 
  glimpse()
## Observations: 1,708,337
## Variables: 36
## $ channelGrouping                <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ visitNumber                    <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1...
## $ browser                        <int> 39, 57, 39, 115, 39, 39, 39, 39...
## $ operatingSystem                <int> 21, 8, 21, 7, 1, 21, 21, 21, 21...
## $ isMobile                       <int> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1...
## $ deviceCategory                 <int> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 3...
## $ continent                      <int> 3, 5, 4, 3, 4, 4, 3, 5, 4, 4, 3...
## $ subContinent                   <int> 21, 1, 19, 16, 13, 19, 18, 1, 2...
## $ country                        <int> 211, 13, 187, 95, 218, 101, 155...
## $ region                         <int> 193, NA, 99, NA, NA, NA, NA, 33...
## $ metro                          <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ city                           <int> 378, NA, 475, NA, NA, NA, NA, 1...
## $ networkDomain                  <int> 37454, 10098, NA, NA, NA, 12548...
## $ campaign                       <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ source                         <int> 207, 207, 207, 207, 207, 207, 2...
## $ medium                         <int> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4...
## $ keyword                        <int> NA, NA, NA, 967, NA, NA, NA, NA...
## $ isTrueDirect                   <int> NA, NA, NA, NA, 1, NA, NA, NA, ...
## $ referralPath                   <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adContent                      <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.page          <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.slot          <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.gclId         <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.adNetworkType <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ adwordsClickInfo.isVideoAd     <int> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ pageviews                      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ bounces                        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ newVisits                      <int> 1, 1, 1, 1, NA, 1, 1, 1, 1, 1, ...
## $ year                           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wday                           <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6...
## $ hour                           <int> 16, 6, 2, 6, 14, 10, 12, 11, 9,...
## $ pageviews_mean_vn              <dbl> 3.347945, 3.347945, 3.347945, 3...
## $ pageviews_mean_country         <dbl> 1.907835, 3.137253, 2.757053, 2...
## $ pageviews_mean_city            <dbl> 1.601546, 1.000000, 2.523963, 1...
## $ pageviews_mean_dom             <dbl> 1.728231, 2.856089, 1.000000, 1...
## $ pageviews_mean_ref             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
rm(tr_te); invisible(gc()) 

Second, we create train, validation and test sets. We use time-based split:

dtest <- xgb.DMatrix(data = data.matrix(tr_te_xgb[-tri, ]))
tr_te_xgb <- tr_te_xgb[tri, ]
dtr <- xgb.DMatrix(data = data.matrix(tr_te_xgb[idx, ]), label = log1p(y[idx]))
dval <- xgb.DMatrix(data = data.matrix(tr_te_xgb[!idx, ]), label = log1p(y[!idx]))
dtrain <- xgb.DMatrix(data = data.matrix(tr_te_xgb), label = log1p(y))
cols <- colnames(tr_te_xgb)
rm(tr_te_xgb); invisible(gc)

The next step is to train the model:

p <- list(objective = "reg:linear",
          booster = "gbtree",
          eval_metric = "rmse",
          nthread = 4,
          eta = 0.05,
          max_depth = 7,
          min_child_weight = 5,
          gamma = 0,
          subsample = 0.8,
          colsample_bytree = 0.7,
          colsample_bylevel = 0.6,
          nrounds = 2000)

set.seed(0)
m_xgb <- xgb.train(p, dtr, p$nrounds, list(val = dval), print_every_n = 100, early_stopping_rounds = 100)
## [1]  val-rmse:2.099455 
## Will train until val_rmse hasn't improved in 100 rounds.
## 
## [101]    val-rmse:1.699214 
## [201]    val-rmse:1.699873 
## Stopping. Best iteration:
## [102]    val-rmse:1.699087
xgb.importance(cols, model = m_xgb) %>% 
  xgb.plot.importance(top_n = 25)

Finally, we make predictions:

pred_xgb_tr <- predict(m_xgb, dtrain)
pred_xgb <- predict(m_xgb, dtest) 
sub <- "xgb_gs.csv"
submit(pred_xgb)

rm(dtr, dtrain, dval, dtest, m_xgb); invisible(gc)

10.4 Distributions of predictions

Let’s compare predictions for the train set:

As we can see the distributions of the predictions are quite different. The XGB model tends to produce more narrow interval - closer to the true distribution.

Average prediction of glm/keras/xgboost

pred_avg <- log1p((expm1(pred_glm) + expm1(pred_nn) + expm1(pred_xgb)) / 3)
sub <- "avg_gs.csv"
submit(pred_xgb)

Distributions of the predictions for the test set differ much too, nevertheless after proper tuning of the models they can be useful for ensembling.